NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Mapping from Meaning: Addressing the Miscalibration of Prompt-Sensitive Language Models

https://doi.org/10.1609/aaai.v39i22.34540

Cox, Kyle; Xu, Jiawei; Han, Yikun; Xu, Rong; Li, Tianhao; Hsu, Chi-Yang; Chen, Tianlong; Gerych, Walter; Ding, Ying (April 2025, Proceedings of the AAAI Conference on Artificial Intelligence)

An interesting behavior in large language models (LLMs) is prompt sensitivity. When provided with different but semantically equivalent versions of the same prompt, models may produce very different distributions of answers. This suggests that the uncertainty reflected in a model's output distribution for one prompt may not reflect the model's uncertainty about the meaning of the prompt. We model prompt sensitivity as a type of generalization error, and show that sampling across the semantic concept space with paraphrasing perturbations improves uncertainty calibration without compromising accuracy. Additionally, we introduce a new metric for uncertainty decomposition in black-box LLMs that improves upon entropy-based decomposition by modeling semantic continuities in natural language generation. We show that this decomposition metric can be used to quantify how much LLM uncertainty is attributed to prompt sensitivity. Our work introduces a new way to improve uncertainty calibration in prompt-sensitive language models, and provides evidence that some LLMs fail to exhibit consistent general reasoning about the meanings of their inputs.
more » « less
Free, publicly-accessible full text available April 11, 2026
The role of social determinants of health in mental health: An examination of the moderating effects of race, ethnicity, and gender on depression through the all of us research program dataset

https://doi.org/10.1371/journal.pmen.0000015

Kammer-Kerwick, Matt; Cox, Kyle; Purohit, Ishani; Watkins, S Craig (August 2024, PLOS Mental Health)
Lautarescu, Alexandra (Ed.)
We investigate how select identity characteristics moderate the role of several SDoH domains on major depressive disorder (MDD). Our study considers an analytical sample of 86,954 participants from the NIH-funded All of Us (AoU) Research Program in the USA. Our independent variables and moderators come from survey responses and our outcome is an EHR diagnostic code. We include race/ethnicity and gender/sexual identity to moderate the role of food insecurity, discrimination, neighborhood social cohesion, and loneliness in assessing risk for MDD diagnosis. We examine those moderating effects based on connections seen in the literature. Our findings illustrate the complexity of where and how people live their lives can have significant differential impact on MDD. Women (AOR = 1.60, 95% CI = [1.53, 1.68]) and LGBTQIA2+ individuals (AOR = 1.71, 95% CI = [1.60, 1.84]) exhibit a significantly higher likelihood of MDD diagnosis compared to cisgender heterosexual males. Our study also reveals a lower likelihood of MDD diagnosis among Asian/Asian American individuals (AOR = 0.41, 95% CI = [0.35, 0.49]) compared to White individuals. Our results align with previous research indicating that higher levels of food insecurity (AOR = 1.30, 95% CI = [1.17, 1.44]) and loneliness (AOR = 6.89, 95% CI = [6.04, 7.87]) are strongly associated with an increased likelihood of MDD. However, we also find that social cohesion (AOR = 0.92, 95% CI = [0.81, 1.05]) does not emerge as a significant predictor, contradicting some literature emphasizing the protective role of neighborhood cohesion. Similarly, our finding that transience (AOR = 0.95, 95% CI = [0.92, 0.98]) reduces the likelihood of MDD diagnosis contradicts conventional wisdom and warrants further exploration. Our study provides a reminder of the substantial challenges for research focused on marginalized community segments and that deliberate sampling plans are needed to examine those most marginalized and underserved.
more » « less
Full Text Available
Success and failure of ecological management is highly variable in an experimental test

https://doi.org/10.1073/pnas.1911440116

White, Easton R.; Cox, Kyle; Melbourne, Brett A.; Hastings, Alan (November 2019, Proceedings of the National Academy of Sciences)

When managing natural systems, the importance of recognizing the role of uncertainty has been formalized as the precautionary approach. However, it is difficult to determine the role of stochasticity in the success or failure of management because there is almost always no replication; typically, only a single observation exists for a particular site or management strategy. Yet, assessing the role of stochasticity is important for providing a strong foundation for the precautionary approach, and learning from past outcomes is critical for implementing adaptive management of species or ecosystems. In addition, adaptive management relies on being able to implement a variety of strategies in order to learn—an often difficult task in natural systems. Here, we show that there is large, stochastically driven variability in success for management treatments to control an invasive species, particularly for moderate, and more feasible, management strategies. This is exactly where the precautionary approach should be important. Even when combining management strategies, we show that moderate effort in management either fails or is highly variable in its success. This variability allows some management treatments to, on average, meet their target, even when failure is probable. Our study is an important quantitative replicated experimental test of the precautionary approach and can serve as a way to understand the variability in management outcomes in natural systems which have the potential to be more variable than our tightly controlled system.
more » « less
Full Text Available
Optimal Design of Cluster- and Multisite-Randomized Studies Using Fallible Outcome Measures

https://doi.org/10.1177/0193841X19870878

Cox, Kyle; Kelcey, Benjamin (August 2019, Evaluation Review)

Background:Evaluation studies frequently draw on fallible outcomes that contain significant measurement error. Ignoring outcome measurement error in the planning stages can undermine the sufficiency and efficiency of an otherwise well-designed study and can further constrain the evidence studies bring to bear on the effectiveness of programs. Objectives:We develop simple formulas to adjust statistical power, minimum detectable effect (MDE), and optimal sample allocation formulas for two-level cluster- and multisite-randomized designs when the outcome is subject to measurement error. Results:The resulting adjusted formulas suggest that outcome measurement error typically amplifies treatment effect uncertainty, reduces power, increases the MDE, and undermines the efficiency of conventional optimal sampling schemes. Therefore, achieving adequate power for a given effect size will typically demand increased sample sizes when considering fallible outcomes, while maintaining design efficiency will require increasing portions of a budget be applied toward sampling a larger number of individuals within clusters. We illustrate evaluation planning with the new formulas while comparing them to conventional formulas using hypothetical examples based on recent empirical studies. To encourage adoption of the new formulas, we implement them in the R package PowerUpR and in the PowerUp software.
more » « less

Search for: All records